Identification of Chemical Entities in Patent Documents

نویسندگان

  • Tiago Grego
  • Piotr Pezik
  • Francisco M. Couto
  • Dietrich Rebholz-Schuhmann
چکیده

Biomedical literature is an important source of information for chemical compounds. However, different representations and nomenclatures for chemical entities exist, which makes the reference of chemical entities ambiguous. Many systems already exist for gene and protein entity recognition, however very few exist for chemical entities. The main reason for this is the lack of corpus to train named entity recognition systems and perform evaluation. In this paper we present a chemical entity recognizer that uses a machine learning approach based on conditional random fields (CRF) and compare the performance with dictionary-based approaches using several terminological resources. For the training and evaluation, a gold standard of manually curated patent documents was used. While the dictionary-based systems perform well in partial identification of chemical entities, the machine learning approach performs better (10% increase in F-score in comparison to the best dictionary-based system) when identifying complete entities.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of BKCa channel openers by molecular field alignment and patent data-driven analysis

In this work, we present the first comprehensive molecular field analysis of patent structures on how the chemical structure of drugs impacts the biological binding. This task was formulated as searching for drug structures to reveal shared effects of substitutions across a common scaffold and the chemical features that may be responsible. We used the SureChEMBL patent database, which prov...

متن کامل

TREC Chemical IR Track 2009: A Distributed Dimensional Indexing Model for Chemical Patent Search

For the TREC-2009 Chemical IR Track, we explore development of a distributed information retrieval system based on a dimensional data model. The indexing model supports named entity identification and aggregation of term statistics at multiple levels of patent structure including individual words, sentences, claims, descriptions, abstracts, and titles. The system was deployed across 15 Amazon W...

متن کامل

Chemical and Biological Entity Recognition System from Patent Documents

It is crucial to explore the chemical and biological space covered by patent documents. In order to recognize chemical and biological entities, a recognition system is developed on the basis of open-source machine learning and natural language processing (NLP) toolkits. The system processing pipeline consists of three major components: pre-processing (sentence detection, tokenization), recognit...

متن کامل

Semantic-Web Access to Patent Annotations

SureChEMBL (https://www.surechembl.org) is a patent chemistry resource, originally a commercial product developed by SureChem/Digital Science, and recently made freely available at EMBL-EBI [1]. SureChEMBL uses a live and fully automated cloud-based pipeline that combines text-mining and chemistry tools to extract compounds named or depicted in patent documents and make them readily structure s...

متن کامل

Comparing manual and automated extraction of chemical entities from documents

The chemical information landscape is changing rapidly with a yearly increase of over 1 million new compounds and more than 700,000 publications related to chemistry [1]. Exploring the chemical space covered by relevant journals and patents is a crucial step in early stage medicinal chemistry projects. Extracting chemical entities from unstructured text is a complex task and different approache...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009